logo

Introduction

This analysis concerns the very popular TV show - "The Office". It includes both the analysis of the scripts as well as the analysis of some other data from the IMDb dataset. The methods used include word clouds, PCA, n-grams and some graph analysis.

I hope you enjoy it!

The Libraries

InΒ [Β ]:
#importing built-in libraries
import random
import re
from io import BytesIO

#importing requests for making HTTP requests
import requests

#importing numpy and pandas for data manipulation
import numpy as np
import pandas as pd

#importing networkx and matplotlib for creating the interactions graph
import networkx as nx
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

#importing wikipedia to get access to wiklipedia data
import wikipedia

#importing PIL for image processing
from PIL import Image

#importing plotly and cufflinks for creating visualizations
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
import plotly.io as pio
import cufflinks as cf
cf.go_offline()

# setting default template to plotly_dark for all visualizations
pio.templates.default = "plotly_dark"
# for charts to be rendered properly
init_notebook_mode()

# importing tensorflow and tf_hub for finding similar episodes
#import tensorflow as tf
import tensorflow_hub as hub
from sklearn.decomposition import PCA
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer

#for sentiment analysis
from textblob import TextBlob

#importing wordcloud for creating word clouds
from wordcloud import WordCloud, STOPWORDS

The Data

Time to spill some beans.

kevin chilli

InΒ [Β ]:
#read in the data
office = pd.read_csv('The-Office-Lines-V4.csv', encoding='latin-1') #transcript of the show
episodesData = pd.read_csv('the_office_series.csv') #more data about the episodes (duration, viewership, IMDB rating, etc.)
                          
#dropping the unnamed columns with NAs or no useful information, repeated index columns
office = office.drop('Unnamed: 6', axis=1)
episodesData = episodesData.drop('Unnamed: 0', axis=1)

Understanding the dataΒΆ

InΒ [Β ]:
episodesData.head()
Out[Β ]:
Season EpisodeTitle About Ratings Votes Viewership Duration Date GuestStars Director Writers
0 1 Pilot The premiere episode introduces the boss and s... 7.5 4936 11.2 23 24 March 2005 NaN Ken Kwapis Ricky Gervais |Stephen Merchant and Greg Daniels
1 1 Diversity Day Michael's off color remark puts a sensitivity ... 8.3 4801 6.0 23 29 March 2005 NaN Ken Kwapis B. J. Novak
2 1 Health Care Michael leaves Dwight in charge of picking the... 7.8 4024 5.8 22 5 April 2005 NaN Ken Whittingham Paul Lieberstein
3 1 The Alliance Just for a laugh, Jim agrees to an alliance wi... 8.1 3915 5.4 23 12 April 2005 NaN Bryan Gordon Michael Schur
4 1 Basketball Michael and his staff challenge the warehouse ... 8.4 4294 5.0 23 19 April 2005 NaN Greg Daniels Greg Daniels

The Office IMDB dataset contains:

Season : Season Number ( 1 to 9 )

EpisodeTitle : Name of the episode

About : Description of the episode

Ratings : Ratings given to the episode on IMDb

Votes : Votes given to the episode on IMDb

Viewership : Number of viewers in US ( in millions )

Duration : Duration of the episode ( in minutes )

Date : Release date of the episode

GuestStars : Number of guest stars appeared on the episode

Director : Director(s) of the episode

Writers : Writer(s) of the episode

The 'Transcript' dataset contains all the dialogues in the show, along with the name of the speaker and some other information.

InΒ [Β ]:
office.head()
Out[Β ]:
season episode title scene speaker line
0 1 1 Pilot 1 Michael All right Jim. Your quarterlies look very good...
1 1 1 Pilot 1 Jim Oh, I told you. I couldn't close it. So...
2 1 1 Pilot 1 Michael So you've come to the master for guidance? Is ...
3 1 1 Pilot 1 Jim Actually, you called me in here, but yeah.
4 1 1 Pilot 1 Michael All right. Well, let me show you how it's done.

season : season number

episode : episode number

title : episode title

scene : scene number

speaker : speaker in the scene

line : lines of the speaker

Before going any further, let's check if there are any missing values in the datasets.

InΒ [Β ]:
#checking for missing values
office.isnull().sum()
Out[Β ]:
season     0
episode    0
title      0
scene      0
speaker    0
line       0
dtype: int64
InΒ [Β ]:
#checking for missing values
episodesData.isnull().sum()
Out[Β ]:
Season            0
EpisodeTitle      0
About             0
Ratings           0
Votes             0
Viewership        0
Duration          0
Date              0
GuestStars      159
Director          0
Writers           0
dtype: int64

No NAs or NULLs - great success! Well, aside for GuestStars, but I'm not interested in those for this analysis. Now, let's see how many speakers are there in the series.

InΒ [Β ]:
print(office['speaker'].unique())
office['speaker'].unique().shape
['Michael' 'Jim' 'Pam' 'Dwight' 'Jan' 'Michel' 'Todd Packer' 'Phyllis'
 'Stanley' 'Oscar' 'Angela' 'Kevin' 'Ryan' 'Man' 'Roy' 'Mr. Brown' 'Toby'
 'Kelly' 'Meredith' 'Travel Agent' 'Man on Phone' 'Everybody' 'Lonny'
 'Darryl' 'Teammates' 'Michael and Dwight' 'Warehouse worker' 'Madge'
 'Worker' 'Katy' 'Guy at bar' 'Other Guy at Bar' 'Guy At Bar'
 'Pam and Jim' 'Employee' "Chili's Employee" 'Warehouse Guy'
 'Warehouse guy' 'Man in Video' 'Video' 'Actor' 'Redheaded Actress'
 "Mr. O'Malley" 'Albiny' "Pam's Mom" 'Carol' 'Bill' 'Everyone' 'Crowd'
 'song' 'Song' 'Dwight and Michael' 'Sherri' 'Creed' 'Devon' 'Children'
 'Kid' 'Ira' "Ryan's Voicemail" 'Christian' 'Hostess'
 'Michael and Christian' 'Sadiq (IT guy)' 'Mark' 'Improv Teacher'
 'Mary-Beth' 'Girl acting Pregnant' 'Actress' 'Michael and Jim'
 'Kevin & Oscar' 'All' 'Liquor Store Clerk' 'JIm' 'Bob Vance'
 'Phyllis, Meredith, Michael, Kevin' 'Captain Jack' 'Brenda'
 'Darryl and Katy' 'Jim and Pam' 'Billy Merchant' 'Doctor' 'Lab Tech'
 'Dana' "Hooter's Girls" 'Phylis' 'Gil' 'Pam and others' 'Ed' 'Packer'
 'Todd' "Jim's voicemail" 'Guy' 'Group chant' 'All the Men' 'Delivery man'
 'Craig' 'Josh' 'David' 'Dan' 'Overhead' 'Speaker' 'Jim and Dwight'
 'Melissa' 'Sasha' 'Abby' 'Jake' 'The Kids' 'Kids' 'Miss Trudy'
 'Edward R. Meow' 'Chet' 'Young Michael' 'Delivery Woman' 'Delivery Boy'
 'Office Staff' 'Store Employee' 'Pam/Jim' 'Linda' 'Hank'
 'I.D. Photographer' 'Photographer' 'Anglea' 'Female worker'
 "Billy's Girlfriend" 'Billy' 'Dealer' 'Bob' 'Andy' 'Karen'
 'Jerome Bettis' 'Ted' 'Waiter' 'Jim, Josh, and Dwight' 'Evan' 'Alan'
 'Ryan and others' 'Announcer' 'Pretzel guy' 'Cousin Mose' 'Tony' 'Server'
 'Girls' "Kelly's Mom" "Kelly's Father" 'Young Man' 'Andy and Jim'
 'Dwight ' 'M ichael' 'Michael ' 'Dwight:' 'Hannah' 'Martin' 'Male voice'
 'Michael & Dwight' 'Andy & Michael' 'Waitress' 'Chef' 'Woman at bar'
 'Cindy' 'Second Cindy' 'Other waitress' 'Andy and Michael' 'Both'
 'Harvey' 'Buyer' 'Kenny' 'Julius' 'Phone' 'Staples Guy' 'MIchael' 'Lady'
 'Paris' 'Marcy' 'Ben Franklin' 'Elizabeth' 'Priest' 'Uncle Al' 'Randy'
 'Unknown' 'Women' 'College Student' 'Business Student #1'
 'Business Student #2' 'Business Student #3' 'Woman' 'Artist' 'Rachel'
 'Dan Gore' 'Bartender' 'Student 1' 'Student 2' 'Child' 'Hunter' 'Darry'
 'Micheal' 'Chad Lite' 'Jamie' 'Barbara' 'School Official' 'Group'
 'Receptionist' 'IT Tech Guy' 'Nurse' 'Intern' 'Robert Dunder' 'Amy' 'GPS'
 'Larry Myers' 'Ex-client' 'Voice of Thomas Dean' 'sAndy' 'DunMiff/sys'
 'DwightKSchrute' 'Tech Guy' 'Angels' 'Pizza guy' 'Manager'
 'Voice #1 on phone' 'Voice #2 on phone' 'Micahel' 'Michae' 'Nick' 'Mose'
 'Co-Worker 1' 'Stanely' 'Micael' 'Vikram' 'Co-Worker 2' 'Co-Worker 3'
 'Mr. Figaro' 'Oscar and Stanley' 'Ad guy 1' 'Ad guy 2' 'David Wallace'
 'Andy, Creed, Kevin, Kelly, Darryl' 'Andy, Creed, Kevin, Kelly'
 "Michael's Ad" 'Rolando' 'Ben' 'Lester' 'Diane Kelly' 'Diane'
 'Deposition Reporter' 'Council' "Hunter's CD" 'Officer 1' 'Officer 2'
 'Officer' "Wendy's phone operator" 'Margaret' 'Coffee shop worker'
 'W.B. Jones' 'Paul Faust' 'Bill Cress' 'Paul' 'Michael/Dwight' 'Troy'
 'Girl in Club' 'Tall Girl #1' 'All Girls' 'Tall Girl #2'
 'Girl in 2nd club' 'Cleaning lady' 'Michael and Darryl' 'Phil Maguire'
 'Phil' 'Justin' 'Angela and Dwight' 'Maguire' 'Woman on mic'
 'Graphics guy' 'Holly' 'Woman over speakerphone'
 'Vance Refrigeration guy' 'Holy' 'Ronnie' 'Professor' 'Friend' 'JIM9334'
 'Receptionitis15' 'Michael & Holly' 'Dight' 'Kendall' 'Man on phone'
 'Hank ' 'Guy in audience' 'Michael and Holly'
 'Michael, Holly, and Darryl' 'Tom' 'Pete' 'Mother' 'Alex' 'Customer'
 'Stewardess' 'Beth' 'Concierge' 'Marie' 'Guy at table' 'Concierge Marie'
 'Client' 'Dacvid Walalce' 'David Wallcve' 'Dacvid Wallace' 'Leo'
 'Vance Refrigeration Guy' 'Police Officer 1' 'Police Officer 2'
 'Guy buying doll' 'Rehab Nurse' 'Everyone watching'
 'Entire Prince family' 'Prince Grandfather' 'Entire office' 'Jim '
 'Prince' 'Prince Granddaughter' 'Prince Grandmother' 'Prince Son'
 'Phyllis and Creed' 'Lawyer' 'CPR trainer' 'CPR Trainer' 'Rose'
 'Jessica Alba' 'Lily' 'Sam' 'Warehouse Michael' 'Julia' 'A.J.'
 'Phone Salesman' 'Jim, Pam, Michael and Dwight' 'Blood Drive Worker'
 'Blood Girl' 'Lynn' 'Blonde' 'Eric' 'Girl' 'Charles' 'Stephanie'
 'Employees' 'Isaac' 'Angela and Kelly' 'Supervisor' 'Michal' 'Nana'
 'Chares' 'Old Woman' 'Erin' 'Dwight and Erin' 'Dwight and Andy'
 'Michael, Pam & Ryan' 'Secretary' 'Automated phone voice' 'Mr. Schofield'
 'Financial Guy' 'Ty' 'Jessica' 'Vance Refrigeration Guy 1'
 'Vance Refrigeration Guy 2' 'VRG 1' 'VRG 2' 'Rolph' 'AJ'
 'Man from Buffalo' 'Woman from Buffalo' 'Dwight & Andy' 'Female Intern'
 'Female intern' 'Maurie' 'Megan' 'Gwenneth' 'Front Desk Clerk'
 'Mr. Halpert' 'Mema' 'Mr. Beesly' 'Little Girl' 'Penny' 'Isabel'
 'Hotel Employee' 'Hotel Manager' "Pam's mom" 'Tom Halpert' 'Pete Halpert'
 'Tom and Pete' "Pam's dad" 'Grotti' 'Andy and Dwight' 'Credit card rep'
 'Rep' 'Various' 'Keena Gifford' 'Helene' "David Wallace's Secretary"
 'Voice on CD player' 'Limo Driver' 'Jim & Pam' 'Laurie' 'Registrar'
 'Security' 'Woman in line' 'Man in line' 'Shareholder'
 'Female Shareholder' 'Second Shareholder' 'Third Shareholder'
 'Fourth Shareholder' "O'Keefe" 'Mikela' 'Students' 'Teacher' 'Lefevre'
 'Zion' 'Deliveryman' 'Michael and Erin' 'Daryl' 'Office' 'Kelly and Erin'
 'Matt' 'Computron' 'Fake Stanley' 'Gabe' 'Andy & Erin' 'Christian Slater'
 'Jo Bennett' 'Jo' 'Jerry' 'Teddy Wallace' 'Mrs. Wallace' 'Teddy'
 'Dwight, Jim and Michael' 'Policeman' 'Hospital employee'
 "(Pam's mom) Heleen" 'Kathy' 'Dale' 'Clark' ' Jim' 'Isabelle' 'D'
 'Warehouse guy 1' 'Warehouse guy 2' 'Reid' 'Night cleaning crew'
 'Miichael' 'Dwight: ' 'Michael: ' 'Jim: ' 'Meredith: ' 'Angela: '
 'Creed: ' 'Phyllis: ' 'Everyone: ' 'Oscar: ' 'Stanley: ' 'Matt: '
 'Warehouse Guy: ' 'Darryl: ' 'Andy: ' 'Pam: ' 'Erin: ' 'Kevin: '
 'Julie: ' 'Isabel: ' 'Hide: ' 'Ryan: ' 'Kelly: ' 'Bar Manager: '
 'Bouncer: ' 'Girl at table: ' 'Cookie Monster' 'Dwight.'
 "Hayworth's waiter" "Oscar's voice from the computer" 'Donna' 'Mihael'
 'Hide' 'Old lady' 'Glen' 'Gym Instructor' 'Gym instructor'
 'Dwight and Angela' 'Shane' 'Reporter' 'Realtor' 'Luke'
 'Window treatment guy' 'Angel' 'Salesman' 'Usher' 'Shelby' 'Sweeney Todd'
 'Son' 'Nate' 'Employees except Dwight' 'Astrid' 'Carroll' 'Carrol'
 'Danny' 'Steve' 'Darryl and Andy' 'Church congregation' 'Pastor'
 ' Pastor' 'Female church member' 'Male church member' 'Doug' 'Mee-Maw'
 'MeeMaw' 'Carla' "Jim's Dad" 'Bus driver' 'Michael and Andy'
 'Another guy' 'Radio' 'TV' 'Meridith' 'Robotic Voice' 'Ryan and Michael'
 'Phyliss' 'Dwight & Nate' 'Passer-by' 'Pam ' 'Bass Player' 'Justine'
 'Jada' 'Robert' 'Darrly' 'Member' 'Video Michael' 'Bookstore employee'
 'DJ' 'David Brent' 'Older guy' 'Phyllis, Stanley, Dwight' 'Younger Guy'
 'Older Woman' 'Professor Powell' 'Ryan and Kelly' 'Helen' 'Attendant'
 'Hot Dog Guy' 'Cell Phone Sales Person' 'Boom Box' 'Andy and Erin'
 'Delivery' 'Samuel' 'President' 'Goldenface' 'Cherokee Jack'
 'Michael and Samuel together' "Holly's Mom" "Holly's Dad" 'Deangelo'
 'Deangelo/Michael' 'Denagelo' "Darryl's sister" 'DeAngelo' '"Jo"'
 '"Angela"' '"Jim"' '"Phyllis"' 'Together' 'Audience' 'Erin and Kelly'
 'abe' 'Rory' 'DeAgnelo' 'Jordan' 'All but Oscar' ' Jo'
 'Darryl and Angela' 'Fred Henry' 'Fred' 'Warren Buffett' 'Warren'
 'Robert California' 'Merv Bronte' 'Merv' 'Nellie Bertram' 'Nellie'
 'Finger Lakes Guy' 'Pam as "fourth-biggest client"'
 'Pam as "ninth-biggest client"' 'Tattoo Artist' 'Female Applicant'
 'Male Applicant 1' 'Male Applicant 2' 'Gideon' 'Bruce'
 'Dwight, Erin, Jim & Kevin' 'Walter' 'Ellen' 'Walter Jr' 'Andy & Walter'
 'Walter & Walter Jr' "Erin's Cell Phone" 'Bert' 'Gabe/Kelly/Toby'
 'Andy/Pam' 'Andy/Stanley' 'Val' 'Warehouse Crew' 'Cathy' 'Offscreen'
 'Curtis' 'Drummer' 'Pam and Kelly' 'Old Man' 'Andy and Darryl'
 'Darryl and Kevin' 'Park Ranger' 'Chelsea' "Chelsea's Mom" 'Archivist'
 'Narrator' 'Soldier' 'Amanda' 'Susan' 'Andy/Oscar' 'Host'
 'Queerenstein Bears' "Oscar's friend" 'Stu' 'Stonewall Host'
 'Senator Lipton' 'Ernesto' 'Cece' 'Saleswoman' 'Emergency Operator'
 'Paramedic' 'Donna Muraski' 'Wally Amos' 'Angela/Pam' 'Brandon' 'Blogger'
 'Blogger 2' 'Lady Blogger' 'Patty' 'Old Lady' 'Others' 'Elderly Woman'
 'Irene' 'Alonzo' 'Glenn' 'Kevin & Meredith' 'Lauren' 'Party guests'
 'Magician' 'Ravi' 'Robert & Creed' 'Wrangler' 'Senator' 'Vet' 'Harry'
 'Mr. Ramish' 'Calvin' 'Off-camera' 'Rafe' 'Fake Jim' 'Voicemail'
 'Nellie and Pam' 'Video Andy' 'Phyllis, Kevin & Stanley' 'HCT Member #1'
 'HCT Member #2' 'Broccoli Rob' 'Businessman #1' 'Businessman #2'
 'Businessman #3' 'HCT' 'HCT Member #3' 'White' 'Boat Guy' 'Walt Jr.'
 'Senator Liptop' 'Business partner' 'Molly' 'Colin' 'Trevor'
 'Julius Irving' 'New Instant Message' 'Suit Store Father'
 'Athlead Employee' 'Dennis' 'Wade' 'Suit Store Son'
 'Female Athlead Employee' '3rd Athlead Employee' '4th Athlead Employee'
 'Co-worker' 'Co-worker #2' 'Mr. Romanko' 'Dance Teacher' 'Ballerinas'
 'Parent in Audience' 'Parent in audience #2' 'Parent in audience #1'
 'Investor' 'Lonnie' 'Fast Food Worker' 'Drive Thru Customer' 'Brian'
 'Cameraman' 'Rolf' 'Gabor' 'Zeke' 'Melvina' 'Wolf' 'Sensei Ira' 'Frank'
 'Party Announcer' 'Party Guest' 'Party Photographer' 'Party Waiter'
 'Nail stylist 1' 'Nail stylist 2' 'Nail manager' 'Shirley'
 'Athlead Coworker' 'Roger' 'Alice' "Oscar's Computer" 'Jeb'
 'German Minister' 'Fannie' 'Henry' 'Esther' 'Aunt Shirley' 'Cameron'
 'Promo Voice' 'Ryan Howard' 'Mr. Ruger' 'Ruger Sister 1' 'Salesmen'
 'Ruger Sister 2' 'Angela & Oscar' 'Reporter #1' 'Reporter #2'
 'Mrs. Davis' 'Carla Fern' 'Director' 'Producer'
 'Bob Vance, Vance Refrigeration' 'Production Assistant' 'Sensei' 'Philip'
 'Check-in guy' 'Casey' 'Mark McGrath' 'Jim & Dwight' 'Camera Crew'
 'Phillip' 'People in line' 'Santigold' 'Aaron Rodgers' 'Clay Aiken'
 'Camera Man' 'Malcolm' 'Casey Dean' 'Seth Mayers' 'Bill Hader' 'Dakota'
 'Stripper' 'Jakey' 'Man 1' 'Woman 1' 'Woman 2' 'Man 2' 'Moderator'
 'Man 3' 'Woman 3' 'Woman 4' 'Joan' 'Minister' 'Carol Stills']
Out[Β ]:
(775,)

Well, there's a lot of them - 775 to be exact. However, not all of them are unique characters, e.g. one example is two lines described as "Andy & Michael" and "Andy and Michael" - counted as two, despite being the same characters. Therefore, there are definitely more reocurring names due to spelling errors, the way the script is written and other factors.

General Analysis

This section does not concern the script yet, but is an introductory analysis of some interesting statistics concerning "The Office".

Number of episodes per season

InΒ [Β ]:
g1 = episodesData.groupby(['Season'], as_index=False).count()
g1 = g1[['Season','EpisodeTitle']]
g1.rename(columns={'EpisodeTitle':'NoOfEpisodes'}, inplace=True)

fig = px.bar(g1,x='Season',y='NoOfEpisodes', color_discrete_sequence=['green'])
fig.update_layout(title_text='Number of episodes per Season')
fig.show()

As it can be seen, seasons 5 and 6 have the most number of episodes (26), while season 1 has the least (6), as it was the "Pilot" season to see whether people will like the show. Aftern learning the audience's taste, later seasons have roughly 22-24 episodes, with season 4 being an outlier with only 14 episodes.

Top 10 highest rated episodes

InΒ [Β ]:
top_10_rated = (episodesData.sort_values(by=['Ratings','Votes'],ascending=False)).iloc[:10,:]
fig = px.bar(top_10_rated,x='EpisodeTitle',y='Ratings',color_discrete_sequence=['purple'])
fig.update_layout(title_text='Top 10 highest rated episodes of all time')
fig.show()

Top 10 longest episodes

InΒ [Β ]:
top_10_long = (episodesData.sort_values(by=['Duration','Ratings'],ascending=False)).iloc[:10,:]
fig = px.bar(top_10_long,x='EpisodeTitle',y='Duration',color_discrete_sequence=['gold'])
fig.update_layout(title_text='Top 10 longest episodes of all time',template='plotly_dark')
fig.show()

Ratings for each season

InΒ [Β ]:
rats = pd.DataFrame(episodesData.groupby(['Season'])['Ratings'].mean()).reset_index()
fig = px.line(rats,x='Season',y='Ratings')
fig.update_layout(title_text='Ratings for each season',template='plotly_dark')
fig.show()

After mixed reception of the first season and reworking the formula to make the series less "edgy", later seasons improved in ratings. A significant dip can be seen for season 8, which is the first season without Michael Scott.

Number of episodes written per person

InΒ [Β ]:
mlb = MultiLabelBinarizer()
writerDf = pd.DataFrame({})
writerDf['WriterList'] = episodesData['Writers'].apply(lambda x: [y.strip() for y in x.split('|')])

mlb.fit(writerDf['WriterList'])
#creating columns = the classes of the multilabelbinarizer
writerDf[mlb.classes_] = mlb.transform(writerDf['WriterList'])
writerDf.drop('WriterList',axis = 1, inplace = True)

writerEpisodes = writerDf.sum().reset_index()
writerEpisodes.columns = ['Writer', 'Number of Episodes']
writerEpisodes = writerEpisodes.sort_values(by = 'Number of Episodes')
fig = px.bar(writerEpisodes,x = 'Number of Episodes', y = 'Writer', title = 'Number of Episodes Written',
             height  = 1000, color = 'Number of Episodes', color_continuous_scale='greens', template = 'plotly_dark')
fig.show()

Mindy Kaling (a.k.a Kelly Kapoor) wrote the biggest number of episodes for "The Office" - 22 - while also being on the main characters in the show. Second most producign writer was Paul Liebersteing (a.k.a Toby Flenderson) with 16 episodes, and B. J. Novak (a.k.a Ryan Howard) is on the third spot with 15 episodes together with two other writes for the show who only made guest appearances in a few episodes.

Relation between number of dialogues and rating

InΒ [Β ]:
episodeDialogues = office.groupby('title')['line'].count().reset_index()
episodeDialogues = pd.merge(episodeDialogues,episodesData, left_on = 'title', right_on = 'EpisodeTitle')
fig  = px.scatter(episodeDialogues, x = 'line', y = 'Ratings', trendline = 'ols', color = episodeDialogues['Season'].astype('category'),
                 hover_name='EpisodeTitle',
                 title = 'Relation Between Number of Dialogues and Rating')
fig.show()

As it can be seen on the graph, there is some correlation between the number of dialogues and ratings for a given season. It's most significant in season 9, with R squared metric equal to ~0.5. Some seasons show a negative relationship (season 1 and season 7) while others are flat (season 5).

Word Clouds!

A regular WordCloud() function could've been used, however, this make_cloud function takes both the string and a background image to create a word cloud.

InΒ [Β ]:
#function for grey colour of cloud
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

#function that makes the cloud
def make_cloud(x, url):
    response = requests.get(url) 
    mask = np.array(Image.open(BytesIO(response.content))) #converting image to numpy array to make mask
    cloud = WordCloud(background_color='black',
                      width=5000, height=5000, 
                      max_words=2000, max_font_size=200, 
                      min_font_size=1, mask=mask, stopwords=STOPWORDS)
    cloud.generate(x) #generating WordCloud
    
    fig, ax = plt.subplots(figsize=(15, 15))
    ax.imshow(cloud.recolor(color_func=grey_color_func, random_state=3), interpolation='bilinear') # Adding grey colour
    ax.set_axis_off()
    
    plt.show(cloud)

Before going into further analysis, let's see what the Wikipedia page of "The Office" has to say!

InΒ [Β ]:
# Looking up wikipedia pages for the TV show
wikipedia.search('The Office (US)')

# Collecting the content to create a word cloud
the_office = wikipedia.page('The Office (American TV Series)')
df_content = the_office.content

# Creating a word cloud
make_cloud(df_content, 'https://i.etsystatic.com/16438614/r/il/c31bd2/1806659071/il_fullxfull.1806659071_pn8j.jpg')
No description has been provided for this image

As it can be seen, some of the most popular words describing the show on this particular page are: "office", "episode", "season" and "series" - which is to be expected, as it's the page describing the show and not the transcrpits. However, some of the characters' names can be seen too: "Michael", "Pam" and "Jim", as well as the name of the company they worked for - "Dunder Mifflin".

Most used words in the seriesΒΆ

InΒ [Β ]:
office_all = office.copy()

make_cloud(office_all['line'].sum(), 'https://i.ibb.co/PG0hr7Z/6.png')
No description has been provided for this image

Most used words by MichaelΒΆ

InΒ [Β ]:
office_filtered_m = office[office['speaker'] == 'Michael'] 

make_cloud(office_filtered_m['line'].sum(), 'https://i.ibb.co/f2hvgtJ/7.png')
No description has been provided for this image

Most used words by DwightΒΆ

InΒ [Β ]:
office_dwight = office[office['speaker'] == 'Dwight'] 
  
make_cloud(office_dwight['line'].sum(), 'https://i.ibb.co/wBwtp79/8.png')
No description has been provided for this image

Most used words by JimΒΆ

InΒ [Β ]:
office_jim = office[office['speaker'] == 'Jim'] 
  
make_cloud(office_jim['line'].sum(), 'https://i.ibb.co/wwGHd9P/9.png')
No description has been provided for this image

Most used words by PamΒΆ

InΒ [Β ]:
office_pam = office[office['speaker'] == 'Pam'] 
  
make_cloud(office_pam['line'].sum(), 'https://i.ibb.co/ZNJpyN7/10.png')
No description has been provided for this image

Dialogues Analysis

In this section, let's start with checking who spoke the most in this show!

Top Speakers

InΒ [Β ]:
numberOfLinesSpoken = office['speaker'].value_counts().reset_index()
numberOfLinesSpoken.columns = ['Speaker','Number of Dialogues']
numberOfLinesSpoken = numberOfLinesSpoken.sort_values(by = 'Number of Dialogues', ascending = False)
fig = px.bar(numberOfLinesSpoken[:15][::-1],x = 'Number of Dialogues', y = 'Speaker', orientation = 'h', 
             title = '<b>Top 15 Speakers with the Most Lines</b>',
            color_continuous_scale=px.colors.sequential.Blugrn,color = 'Number of Dialogues',text = 'Number of Dialogues')
fig.show()

Unsurprisingly, it's Michael - the main character of the first 7 seasons. Right after him, there is the "Assistant (to the) Regional Manager" - Dwight, and Michael's right hand - Jim. Surprisingly however, Kevin is the character with the 5th highest number of lines, standing before Angela by 1 line.

Characters with the biggest number or dialogues each season

InΒ [Β ]:
fig = make_subplots(rows = 3,cols = 3, 
                    subplot_titles=[f'Season {i}' for i in range(1,10)],
                   horizontal_spacing=0.1)

for i in range(3):
    for j in range(3):
        season = i*3 + j + 1
        seasonDf = office[office['season'] == season]
        speakerDialogues = seasonDf['speaker'].value_counts().reset_index()
        speakerDialogues.columns = ['Speaker','Number of Dialogues']
        speakerDialogues = speakerDialogues.sort_values(by = 'Number of Dialogues', ascending = False).iloc[:3,]
        trace = go.Bar(x = speakerDialogues['Number of Dialogues'], y = speakerDialogues['Speaker'], name = f'Season {season}', orientation = 'h')
        fig.add_trace(trace, row = i+1, col = j+1)
fig.update_layout(showlegend = False, title = '<b>Top 3 Speakers each Season</b>')
fig.show()

As it turns out, each season it is the boss who gets to speak the greatest number of lines!

Sayings Scores

In this section, I analyse some of the most popular sayings/phrases used in the show to see which characters used them the most. Let's start with some proper formatting of the lines to make the analysis more thorough.

InΒ [Β ]:
def formatLine(line):
    line = line.lower()
    line = re.sub(r'[^\w\s]','',line)
    return line

office['formatted_lines'] = office['line'].apply(lambda x:formatLine(x))

office.head()
Out[Β ]:
season episode title scene speaker line formatted_lines
0 1 1 Pilot 1 Michael All right Jim. Your quarterlies look very good... all right jim your quarterlies look very good ...
1 1 1 Pilot 1 Jim Oh, I told you. I couldn't close it. So... oh i told you i couldnt close it so
2 1 1 Pilot 1 Michael So you've come to the master for guidance? Is ... so youve come to the master for guidance is th...
3 1 1 Pilot 1 Jim Actually, you called me in here, but yeah. actually you called me in here but yeah
4 1 1 Pilot 1 Michael All right. Well, let me show you how it's done. all right well let me show you how its done

Firstly, one of the most important statistics - the number of "That's what she said"'s per season.

InΒ [Β ]:
def get_count(office,n):
    s_df = office[office['season']==n].reset_index()
    scount = 0
    for i in range(0,len(s_df)):
        x = re.search('thats what she said',s_df['formatted_lines'][i])
        if type(x)==re.Match:
            scount = scount + 1
    return scount

sc , s = [] , []
for i in range(1,10):
    sc.append(get_count(office,i))
    s.append(i)

fig = px.bar(x=sc,y=s,color_discrete_sequence=['#7ec0ee'],
             orientation='h',labels={'sc':'# of jokes',
                                    's':'Season'})
fig.update_layout(title_text='Number of That\'s What She Said Jokes per Season',
                  xaxis_title='Number of "That\'s what she said"\'s',yaxis_title='Season')
fig.show()

Now, let's see who used this phrase the most.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("thats what she said")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']

fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]}
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>That's What She Said Score</b>",height = 200)
fig.show()

Unsurprisingly, it's Michael. His score is a big one (that's what she said).

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains('dunder mifflin')]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(5):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]}
    )
    fig.add_trace(trace)
fig.update_layout(title = '<b>Dunder Mifflin Score</b>',height = 200)
fig.show()

Moreover, Michael Mentions Dunder Mifflin the most. Surprisingly, Erin takes the 5th place.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("boss|manager")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Boss / Manager Score</b>",height = 200)
fig.show()

As a real manager should, managers are 4 characters that use this phrase the most. Michael is still in the lead - so who says "Michael" the most?

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("michael")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Michael Score</b>",height = 200)
fig.show()

It's Dwight. Michael takes 4th place. Let's look at some other statistics:

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("sale|sales")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Sale Score</b>",height = 200)
fig.show()

The boss and the top salesman mention sales the most.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("cornell")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Cornell Score</b>",height = 200)
fig.show()

It's clear to see who went to Cornell - Andy.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("beet|beets")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Beets Score</b>",height = 200)
fig.show()

Dwight, the real beets afficionado, mentions this vegetable the most.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("schrute")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Schrute Score</b>",height = 200)
fig.show()

Dwight really likes to say his last name.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("dunder mifflin this is")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>\"Dunder mifflin this is ...\" Score</b>",height = 200)
fig.show()

This classic phrase was used by Pam 21 times, while Erin who was a receptionist for a much shorter time used it 6 times.

InΒ [Β ]:
df = office[office['formatted_lines'].str.contains("ryan")]
df = df['speaker'].value_counts().reset_index()[:5]
df.columns = ['Speaker', 'Number of References']
fig = go.Figure()
for i in range(len(df)):
    trace = go.Indicator(
        mode = "number",
        value = df.iloc[i,:]['Number of References'],
        title = {"text": f"<b>{df.iloc[i,:]['Speaker']}</b>"},
        domain = {'x': [0.1*(i+1), 0.2*(i+1)], 'y': [0, 1]},
    )
    fig.add_trace(trace)
fig.update_layout(title = "<b>Ryan Score</b>",height = 200)
fig.show()

"The Temp" was mentioned the most by Michael - yet another win for Mr. Scott.

N-grams

This section concerns the visualisation of the most frequently used n-grams in the series.

Defining a function to visualise n-gramsΒΆ

InΒ [Β ]:
def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) 
                  for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]

The most frequently used 1-grams

InΒ [Β ]:
top_bigrams = get_top_ngram(office['line'],1)[:15]
x,y = map(list,zip(*top_bigrams))
px.bar(x = y,y = x)

The most frequently used 2-grams

InΒ [Β ]:
top_bigrams = get_top_ngram(office['line'],2)[:15]
x,y = map(list,zip(*top_bigrams))
px.bar(x = y,y = x)

The most frequently used 3-grams

InΒ [Β ]:
from sklearn.feature_extraction.text import CountVectorizer
top_trigrams = get_top_ngram(office['line'],3)[:15]
x,y = map(list,zip(*top_trigrams))
px.bar(x = y,y = x)

The most frequently used 4-grams

InΒ [Β ]:
from sklearn.feature_extraction.text import CountVectorizer
top_trigrams = get_top_ngram(office['line'],4)[:15]
x,y = map(list,zip(*top_trigrams))
px.bar(x = y,y = x)

Sentiment Analysis

This sections concerns the sentiment analysis of the script. It's conducted on the formatted_lines defined in the Sayings Scores section.

InΒ [Β ]:
# Defining a function to check the sentiment polarity (whether it is positive or negative or neutral)
def polarity(text):
    return TextBlob(text).sentiment.polarity

office['polarity_score'] = office['formatted_lines'].\
   apply(lambda x : polarity(x))
   
   
px.histogram(office,x='polarity_score')
InΒ [Β ]:
# Defining a function to classify the sentiment based on the polarity 
def sentiment(x):
    if x<0:
        return 'Negative'
    elif x==0:
        return 'Neutral'
    else:
        return 'Positive'
    
office['polarity'] = office['polarity_score'].map(lambda x: sentiment(x))

px.bar(x = office.polarity.value_counts().index,
        y = office.polarity.value_counts(),
        labels={'x':'Sentiment','y':'Count'},
        title='Sentiment Analysis of The Office Dialogues')

As it can be seen, most of the dialoges have a neutral sentiment. Notably, there's a lot more positive than negative dialogues (17 672 vs 6958). Now, let's see who has the highest average sentiment score!

InΒ [Β ]:
# the most positive (main) characters

main_characters = ['Michael', 'Dwight', 'Jim', 'Pam', 'Andy', 'Angela', 'Kevin', 'Oscar', 'Erin', 'Ryan', 'Jan', 'Kelly', 'Creed', 'Stanley', 'Phyllis', 'Meredith', 'Toby', 'Darryl']
main_characters_df = office[office['speaker'].isin(main_characters)]
main_characters_df = main_characters_df.groupby('speaker')['polarity_score'].mean().reset_index()
main_characters_df = main_characters_df.sort_values(by = 'polarity_score', ascending = False)

fig = px.bar(main_characters_df, x = 'polarity_score', y = 'speaker', orientation = 'h',
                title = 'Sentiment Analysis of Main Characters',
                color = 'polarity_score', color_continuous_scale = 'Viridis',
                labels = {'polarity_score':'Average Sentiment Score','speaker':'Character'})
fig.show()

Micheal's dialogues appear to have the most positive sentiment in the series. He's followe by Jim, Ryan and (surprisignly) Jan. The least positive of the main characters are Meredith and (unsurprisingly) Stanley.

Finding Similar Episodes with PCA

InΒ [Β ]:
episodeCorpus = pd.DataFrame({'Episode Number' : [], 'Full Text': [], 'Season' : []})
episodes = []
episodeTexts = []
seasons = []
for season in range(1,10):
    subSeason = office[(office['season'] == season)]
    for episodeNo, df in subSeason.groupby('episode'):
        full_text = df['formatted_lines'].values
        episodes.append(episodeNo)
        episodeTexts.append(" ".join(full_text).lower())
        seasons.append(season)
episodeCorpus['Episode Number'] = episodes
episodeCorpus['Full Text'] = episodeTexts
episodeCorpus['Season'] = seasons

module_url = "https://www.kaggle.com/models/google/universal-sentence-encoder/TensorFlow2/universal-sentence-encoder/2" 
model = hub.load(module_url)

features = model(episodeCorpus['Full Text'].values)
pca = PCA(n_components=2, random_state=42)
reduced_features = pca.fit_transform(features)

episodeTitles = episodesData['EpisodeTitle'].to_list()
episodeTitles.pop(108)
episodeTitles.pop(95)

episodeCorpus['Dimension 1'] = reduced_features[:,0]
episodeCorpus['Dimension 2'] = reduced_features[:,1]
episodeCorpus['Episode Titles'] = episodeTitles
fig = px.scatter(episodeCorpus, x = 'Dimension 1', y = 'Dimension 2', color = 'Season', hover_name='Episode Titles',
                title = '<b>Finding Similar Episodes</b>')
fig.update_traces(marker=dict(size=12))
fig.show()

This graph is extremely interesting as it helps to find episodes that were more of less similar in terms of their content. It can be seen that the festival themed episodes are clubbed together on the right side of the graph and are quite different from the other episodes. It can also be seen S05E25 - Broke (the farmost left) is much different from all the other episodes. We can use this graph to find episodes that do not fall in the central grouping and thus are somewhat different. Few of these Episodes are : Sexual Harassment, PDA, Junior Salesman, Trivia and Livin' the Dream.

Visualising Interactions between Characters

InΒ [Β ]:
#creating the interaction graph

#create episode_id for comparison later 
office['episode_id'] = office['season'].astype(str)+office['episode'].astype(str)

#get 20 main characters 
main_characters = list(office['speaker'].value_counts().index[:20])

main_characters_shuffle = random.sample(main_characters, len(main_characters))
#print(main_characters_suffle)
character_dict = {character: i for i, character in enumerate(main_characters)}
id_dict = {i: character for i, character in enumerate(main_characters)}

#create networkx object
G = nx.Graph()

#get coversation info between characters
scene_before = ""
episode_id_before = -1
for i in range(len(office)):

    #check if episode and location of text is the same
    if scene_before != office["scene"].iloc[i] or office["episode_id"].iloc[i] != episode_id_before:
        scene_before = office.iloc[i]["scene"]
        episode_id_before = office.iloc[i]["episode_id"]
        continue

    scene_before = office.iloc[i]["scene"]
    episode_id_before = office.iloc[i]["episode_id"]

    #get characters
    c1 = office["speaker"].iloc[i]
    c2 = office["speaker"].iloc[i+1]

    #fail check for character not in the interested list 
    if c1 not in main_characters_shuffle or c2 not in main_characters_shuffle:
        continue

    sorted_characters = sorted([c1, c2])
    try:
        #add +1 to weight if characters have conversation on the same sence
        G.edges[sorted_characters]["weight"] += 1
    except KeyError:
        G.add_edge(sorted_characters[0], sorted_characters[1], weight=1)
        
def plot_fig():
    plt.figure(figsize=(25, 25))
    pos = nx.circular_layout(G)
    edges = G.edges()
   
    #darker colors for higher weigth
    colors = [G[u][v]['weight']**0.39 for u, v in edges]
    
    #only looking into characters that had conversation more than 10 times
    weights = [G[u][v]['weight']**0.4 if G[u][v]['weight'] > 10 else 0 for u, v in edges]
    
    #colors
    cmap = matplotlib.cm.get_cmap('viridis_r')
 
    nx.draw_networkx(G, pos, width=weights, edge_color=colors,
                     node_color="green", edge_cmap=cmap, with_labels=False, alpha=0.80)
   
    labels_pos = {name: [pos_list[0], pos_list[1]-0.04] for name, pos_list in pos.items()}
    nx.draw_networkx_labels(G, labels_pos, font_size=20, font_family="sans-serif",
                            font_color="black", font_weight='normal')

    ax = plt.gca()
    ax.margins(0.25)
    plt.axis("equal")
    plt.tight_layout()

#plotting the graph        
plot_fig()
C:\Users\kacch\AppData\Local\Temp\ipykernel_8604\1279815184.py:58: MatplotlibDeprecationWarning:

The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.

No description has been provided for this image

The graph shows the interactions between the main characters (20 most-speaking ones). I also included self-loops around the nodes, that represent how much characters talk to themselselves or how many consecutive lines of a single character there are. As it can be seen, the most interactions are between 4 most recognizable characters - Michael, Jim, Pam and Dwight. Also, the self-loop is the widest at "Michael" node. Other notable interactions are between Erin and Andy, Darryl and Andy and Dwight and Andy, as well as between Michael and Holly and Michael and Jan.

I hope you enjoyed this analysis!ΒΆ

michael pointing fingers